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Chimeric RNAs comprise exons from two or more different genes and have the potential to encode novel proteins that 
alter cellular phenotypes. To date, numerous putative chimeric transcripts have been identified among the ESTs isolated 
from several organisms and using high throughput RNA sequencing. The few corresponding protein products that have 
been characterized mostly result from chromosomal translocations and are associated with cancer. Here, we systemati- 
cally establish that some of the putative chimeric transcripts are genuinely expressed in human cells. Using high 
throughput RNA sequencing, mass spectrometry experimental data, and functional annotation, we studied 7424 putative 
human chimeric RNAs. We confirmed the expression of 175 chimeric RNAs in 16 human tissues, with an abundance 
varying from 0.06 to 17 RPKM (Reads Per Kilobase per Million mapped reads). We show that these chimeric RNAs are 
significantly more tissue-specific than non-chimeric transcripts. Moreover, we present evidence that chimeras tend to 
incorporate highly expressed genes. Despite the low expression level of most chimeric RNAs, we show that 12 novel 
chimeras are translated into proteins detectable in multiple shotgun mass spectrometry experiments. Furthermore, we 
confirm the expression of three novel chimeric proteins using targeted mass spectrometry. Finally, based on our functional 
annotation of exon organization and preserved domains, we discuss the potential features of chimeric proteins with 
illustrative examples and suggest that chimeras significantly exploit signal peptides and transmembrane domains, which 
can alter the cellular localization of cognate proteins. Taken together, these findings establish that some chimeric RNAs are 
translated into potentially functional proteins in humans. 



[Supplemental material is available for this article.] 

Chimeric mRNAs are distinct from conventionally spliced mRNA 
isoforms as they are produced by joining exons from two or more 
different gene loci (Pirrotta 2002; Horiuchi and Aigaki 2006; 
Robertson et al. 2007; Li et al. 2008; Gingeras 2009; Douris et al. 
2010; Herai and Yamagishi 2010; McManus et al. 2010a,b; Pettitt 
et al. 2010; Allen et al. 2011). In humans, chimeric transcripts are 
generated in several ways: frafK-splicing of pre-mRNAs (Gingeras 
2009; Li et al. 2009c), RNA transcription runoff (Akiva et al. 2006; 
Parra et al. 2006), from other errors in RNA transcription process- 
ing (Gingeras 2009), or represent artifacts of RNA sequencing. Al- 
ternatively, chimeric transcripts can be the products of gene fusion 
following inter-chromosomal translocations or intra-chromosomal 
rearrangements (Gingeras 2009; Maher et al. 2009b; Herai and 
Yamagishi 2010). Specific cellular phenotypes are characterized by 
expression of chimeric transcripts, for example, the fused BCR/ABL, 
FUS/ERG, MLL/AF6, and MOZ/CBP genes are expressed in acute 
myeloid leukemia (AML) (Panagopoulos et al. 2003; Nambiar et al. 
2008), and the TMPRSS2/ETS chimera is associated with over- 
expression of the oncogene in prostate cancer (Nambiar et al. 
2008). In principle, chimeric transcripts can augment the number 



' Corresponding author 
E-mail avalencia@cnio.es 

Article published online before print. Article, supplemental material, and publica- 
tion date are at http://www.genome.org/cgi/doi/10.1101/gr.l30062.111. Freely 
available online through the Genome Research Open Access option. 



of gene products available in a given genome and are suspected to 
function not only in cancer (Thomson et al. 2000; The ENCODE 
Project Consortium 2007; Gingeras 2009) but also in normal cells 
(Akiva et al. 2006; Parra et al. 2006). 

A systematic analysis of the location of the 5' termini of 
coding genes expressed in various cell lines was initiated as part of 
the ENCODE pilot project (Denoeud et al. 2007; The ENCODE 
Project Consortium 2007; Tress et al. 2007; Djebali et al. 2008). 
This project discovered that gene boundaries extend well beyond 
the annotated termini in 65% of cases, often extending into neigh- 
boring genes, leading to the production of chimeric RNAs (Gingeras 
2009). A more recent revision of this analysis focusing on chro- 
mosomes 21 and 22 revealed additional cases of chimeric transcripts 
not only connecting neighboring genes but, rather, encompassing 
distal genes (Djebali et al. 2008, 2012). Characterization of these 
chimeric transcripts has highlighted that the information stored in 
the genome and expressed in the transcriptome is not as linear as 
previously believed (Guigo et al. 2006; Gingeras 2009). 

Although some tissue-specific chimeric transcripts as well as 
inter-chromosomal and intra-chromosomal chimeras have been 
identified by paired-end transcriptome sequencing (Maher et al. 
2009a,b), only a limited number of chimeric transcripts and their 
associated protein products have been characterized to date, the 
majority resulting from chromosomal translocations and associ- 
ated with cancer (Mitani 2004; Miura et al. 2004; Eguchi et al. 
2006; Candel et al. 2009; Maher et al. 2009b; Silberg et al. 2010). 
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For instance, gene fusion in chronic myelogenous leukemia (CML) 
leads to an mRNA transcript that encompasses the 5' end of the 
BCR gene and the 3' end of the ABL gene. Notably, translation of 
this transcript produces a chimeric BCR-ABL protein that possesses 
increased tyrosine kinase activity (Rabbitts 1994; Nambiar et al. 2008). 

Various studies have used expressed sequence tag (EST) cov- 
erage to search for chimeric transcripts (Akiva et al. 2006; Parra 
et al. 2006); Li et al. (2009c) performed EST screen in humans, 
mice, fruit flies, and budding yeast. Of the 25 chimeric transcript 
candidates identified in fly and five in yeast, 30% have been con- 
firmed by RT-PCR (Li et al. 2009c). An even higher RT-PCR con- 
firmation rate has been reported for human chimeric transcript 
candidates, ranging from 45% (Akiva et al. 2006) to 34% (Parra 
et al. 2006). As mentioned, the availability and function of cognate 
chimeric proteins has been examined in only a few cases. One 
notable example is a chimera in normal human cells generated by 
rrarcs-splicing of the 5' exons of the JAZF1 gene on chromosome 
7pl5 and the 3' exons of JJAZ1 (SUZ12) on chromosome 17ql (Li 
et al. 2009b). This chimeric RNA is translated in endometrial 
stroma cells and encodes an anti-apoptotic protein (Gingeras 2009; 
Li et al. 2009b). 

The apparently large discrepancy between the number of 
putative chimeric transcripts and chimeric proteins reported to 
date (100:1) could indicate that most chimeric transcripts are not 
translated and perhaps serve to regulate processes at the RNA level. 
However, the discrepancy could reflect the problem that current 
protocols tend to overestimate the true number of chimeric tran- 
scripts. Indeed, most protocols used to identify chimeric transcripts 
rely on a reverse transcription step and the reverse transcriptase is 
known to switch templates, thus creating chimeric artifacts in vitro 
(Houseley and Tollervey 2010). Therefore, it remains unclear what 
proportion of putative chimeric transcripts are genuine, and of these 
how many are translated. 

Here we report screening of 7424 human chimeric transcript 
candidates from GenBank (Benson et al. 2005), which were pre- 
viously collected in the data sets of chimeric RNAs (Li et al. 2009c; 
Kim et al. 2010). We employed functional annotation, high 
throughput RNA sequencing and mass spectrometry experiments. 
In this way, we confirmed the expression of 175 chimeric RNAs 
and we identified 12 novel chimeric proteins in humans. We also 
assessed the tissue specificity of the chimeric RNAs and we com- 
pared the expression of chimeric proteins with that of the parental 
wild-type proteins. Based on our analysis of the chimeric tran- 
scripts, the largest collection identified to date, we define two 
features of chimeric proteins. First, chimeras exploit signal pep- 
tides and transmembrane domains to alter the cellular localization 
of the associated activities. Second, though chimeras themselves 
are tissue-specific transcripts expressed at low levels, chimeras in- 
corporate parental genes that are expressed at a high level. Such 
chimeras could be produced in cancer cells and those associated 



with other diseases, as well as in response to stress in normal cells. 
To illustrate the proposed characteristic features of chimeras, we 
focused on the chimeric proteins validated by RNA-seq at the RNA 
level, as well as those validated by shotgun and targeted mass 
spectrometry at the protein level. 

Results 

Expression of chimeric transcripts in normal cells 

We collected 7424 sequences of candidate human chimeric RNAs 
from GenBank (Benson et al. 2005), previously collected in the 
chimeric RNA data sets (Li et al. 2009c) and the ChimerDB data- 
base (Kim et al. 2010). To determine if these chimeric sequences are 
indeed expressed as transcripts and to assess their level of expres- 
sion, we screened RNA sequencing data sets (Human Body Map 
2.0: see Methods) in 16 tissues (Supplemental Table SI). Briefly, we 
identified the junction sites for each chimeric sequence and then 
searched for matching "chimeric reads," which did not map line- 
arly to annotated transcripts or novel exons in the human genome 
(see Methods). To define if a chimeric read validates a junction, we 
required it to map with at least six nucleotides (nt) on each side of 
the junction (and we allowed for a maximum of three mismatches). 
Our screening procedure inherently excludes reads mapping to 
multiple locations in the genome (repetitive regions), as the chi- 
meric reads by definition do not map to any location in the ge- 
nome or to the annotated transcriptome. 

Among the 7224 ESTs and mRNAs in ChimerDB (Kim et al. 
2010), we found that 333 (4.5%) had at least two matching reads 
from the Human Body Map data set, 212 (3%) had matching reads 
in two tissues, and 156 (2.2%) matched at least two nonidentical 
chimeric reads, i.e., mapped to distinct nucleotide positions in the 
chimera junction site (Table 1; http://chimera.bioinfo.cnio.es/). 
We focused on the cases validated by at least two distinct reads in 
order to rule out synthetic duplicates created during the RNA- 
sequencing protocols. In such cases, the number of reads con- 
firming the junction may be high but they would all align to the 
same position. Demanding at least two distinct mapping positions 
is a useful strategy to avoid this type of bias, and in practice, this 
reduces the number of confirmed chimeras from 333 to 156 (see 
Table 1). Furthermore, half of the remaining 156 cases are validated 
by two to 12 reads, while the other half are validated by 12 to 2694 
reads. Since the chimeric ESTs were primarily identified in cancer 
cells, it is noteworthy that some are expressed also in normal tis- 
sues (Supplemental Material). These findings corroborate those of 
other studies showing that some fusion transcripts originate from 
normal tissues (Akiva et al. 2006; Parra et al. 2006). Of note, as 
a negative control we used the data set of 300 fusion proteins 
found in cancers, generated by translocations and listed in the 
dbCRID database (Kong et al. 2011). Remarkably, we did not find 



Table 1. The expression of chimeric transcripts was confirmed using paired-end RNA-seq reads from various tissues 
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any reads in normal tissues of the Human Body Map 2.0 that 
matched the junction sites of these cancer-associated chimeras. 
This latter finding confirms that chimeras generated by these 
chromosomal translocations are not expressed in the considered 
normal tissues, or at least not at a detectable level. 

To estimate the expression level of chimeras, we used the 
measure introduced by Mortazavi and colleagues in 2008 (Mortazavi 
et al. 2008), namely RPKM (Reads Per Kilobase per Million mapped 
reads; see Methods), which takes into account the depth of se- 
quencing and the length of the considered "junction" region. 
Calculations for the human chimeras were performed with the 
total number of reads set at 1097 million and the "junction" size of 
138 nt (= 2 x [75 nt {a read size) - 6 nt]). The most weakly 
expressed, yet detectable, chimera had two matching reads, cor- 
responding to 0.013 RPKM (Table 1). In general, we observed that 
most chimeras are lowly expressed transcripts (Fig. 1). Noticeably, 
most of the parental genes participating in the formation of 
chimeras (whether expressed or not) are moderately to highly 
expressed, with expression ranging from 0 to 2495 RPKM and 
a median expression level of 12.6 (Fig. 1). Though some genes, like 
tumor necrosis factor (TNF), are not expressed at all in normal 
tissues, the expression levels of most parental genes fall into the 
third quartile of the gene expression distribution. Moreover, our 
data show that genes participating in translated chimeras, i.e., chi- 
meras for which we have evidence of translation (as explained be- 
low), are even highly expressed (Wilcoxon test, P-value < 5 x 10~ 6 ). 
In light of these observations, we concluded that, in general, the 
formation of chimeras is associated with parental genes that ex- 
hibit high expression levels. Furthermore, there is an association 
between detectable expression of the chimera at the protein level 
and the level of expression of its parental genes (Fig. 1; Table 2; 
Supplemental Material). 

Chimeras are lowly expressed transcripts 

To study the expression levels of chimeric transcripts relative to 
other human transcripts we produced density plots of all transcript 
expression levels as described recently (Hebenstreit et al. 201 1). We 
found that the distribution of the expression of all genes is clearly 
bimodal (Fig. 2). The interpretation is that the first peak corre- 
sponds to lowly expressed and putatively nonfunctional mRNAs, 



while the second peak encompasses highly expressed mRNAs 
(Hebenstreit et al. 2011). Our chimeric transcripts clearly fall 
within the first peak, with the exception of two chimeras that fall 
within the second peak. The first one (ESTid = "CD109S91.1") 
contains regions from genes RAB21 and RN45S, where gene RAB21 
is on chromosome 22 and RNA4SS corresponds to an rRNA gene 
located within an unplaced contig. We suspect that this chimera 
results from read-through transcription of the two genes; and the 
unplaced contig, or another copy of the rRNA is actually on 
chromosome 22, next to gene RAB21 . The second highly expressed 
chimera (ESTid = "AB042558.1") comprises exons from PDE4DIP 
and NBPF11, the two genes located on chromosome 1. However, 
this chimera cannot be due to read-through transcription as the 
two genes are located on different strands and separated by >100 
kb. Noticeably, PDE4DIP overlaps NBPF9, raising the possibility 
that some kind of recombination event has occurred involving 
NBPF9 and NBPF1 1 genes either at the genomic or transcriptome 
level, favoring the formation of this chimera. 

Taken together, these observations indicate that most chi- 
meras are expressed at one or two copies per cell on average, and 
notably this expression involves highly expressed genes. Although 
these findings are compatible with a potentially unregulated pro- 
duction of chimeric transcripts, we show below that some chi- 
meras likely exert biological roles as they are expressed at the 
protein level. 

Tissue specificity of chimeric transcripts 

We evaluated the tissue specificity of chimeras relative to all other 
genes. Tissue specificity was measured using Shannon entropy (see 
Methods). Low entropy values correspond to high tissue specificities 
(Fig. 3). We found that, overall, chimeras were more tissue specific 
than other genes (Wilcoxon test, P-value < 2 x 10~ 16 ). However, we 
noticed that in general tissue specificity correlated with expression 
level, such that highly expressed genes tend to be broadly 
expressed, and thus acknowledged that the low expression level of 
most chimeras could be a confounding factor. We therefore per- 
formed a new test, now controlling for the expression level as 
a potential confounding factor and we still found that chimeras 
were more tissue specific than other genes (ANCOVA, P-value < 
7.7 x 10~ 13 ). We conclude that, irrespective of expression level, 



Expression of Genes and Chimeras in Human Tissues 



Genes involved in 
Chimeras 



Genes involved in 
Translated Chimeras 



Figure 1. Expression levels of genes and chimeric transcripts in humans. The expression of genes in human tissues ranges from 0.001 to 15,700 RPKM, 
with a median of 0.588, whereas the expression of chimeras ranges from 0.006 to 1 7.8 RPKM with a median of 0.02. Most of the genes involved in the 
formation of chimeras are moderately to highly expressed, as their expression ranges from 0 to 2495 RPKM, with a median of 12.6. The trend is also 
observed for the translated chimeras and their parental genes (Wilcoxon test, P-value < 5 x 1 0~ 6 ).The whiskers of the boxplot extend to the data extremes 
(see also Supplemental Fig. S4). 
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RPKM distributions of Genes and Chimeras 
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Figure 2. A density plot of RPKM expression levels for all genes versus 
chimeras. The total number of chimeras is much lower than the total 
number of genes. Hence, the densities of the distributions are plotted and 
not the counts. The height of the bars does not correspond to number of 
transcripts, but to the proportion of transcripts in a given expression 
category. The distribution for all genes is bimodal, with chimeras falling in 
the low expressed genes distribution. 



chimeras are significantly more tissue specific than non-chimeric 
transcripts (Fig. 3). 

Chimeric transcripts are detected at the protein level 

To determine if chimeras are expressed at the protein level and rule 
out the possibility that they are artifacts of reverse transcription, 
we used both computational and experimental approaches. First, 
we produced a comprehensive search for unique peptides-spectra 
matches in mass spectrometry databases using the chimeric 
sequences translated in six frames. We considered only unique 
peptides spanning the gene-gene junctions of the chimeras (three 
amino acids at each side of the junction) with a maximum false 
discovery rate (FDR) of 1% (see Methods). Second, we conducted 
in-house experiments to detect chimeric proteins using both 
shotgun proteomics and targeted analysis of the identified unique 
peptides spanning the chimeric junctions. Thus, using these ap- 
proaches we identified 16 unique peptides that span the junction 
sites of human chimeras (FDR < 1%) (Methods; Supplemental Fig. SI; 
Supplemental Material), confirming translation of the 12 cognate 
chimeric transcripts (Table 2). Notably, chimeric reads spanning 
the junction sites of three of these chimeras were identified in 
different tissues of the Human Body Map (the chimera CN3060S0.1, 
with only one read; BG978110.1, 10 reads; and BM838228.1, three 
reads). Finally, we confirmed two putative chimeric proteins by 
the targeted mass spectrometry analysis, termed selective reaction 
monitoring (SRM) using specifically synthesized heavy-labeled 
standards (Supplemental Figs. S2, S3). 

Remarkably, one chimera identified initially in the EST col- 
lection of ChimerDB (Kim et al. 2010), ESTid = "BM838228.1," was 
evident in 18 different mass spectrometry experiments in PeptideAtlas 
(FDR < 1%, placental tissue and embryonic stem cells) (Supple- 



mental Material). This is a chimera of the ribosomal RPL13A 
protein and actin ACTG1 for which two unique overlapping 
peptides that match the chimera junction site were identified 
(LWTVSRCLTASHTVPIYEGYALPHAILR, £-value < 5.1 X 10~ 5 ; 
ASHTVPIYEGYALPHAILR, £-value < 1.3 x 10~ 5 ) (Table 2). Specif- 
ically, the former peptide had 10 supporting peptide-spectra 
matches in the PeptideAtlas experiments and it contained an 
overlap of 12 residues spanning the junction site. Targeted mass 
spectrometry (SRM) was employed to validate and measure the 
levels of this human chimeric protein using unique peptide 
matching at its gene-gene junction site (ASHTVPIYEGYALPHAILR) 
(Supplemental Fig. S2). In addition, we detected three RNA-seq 
reads in two different normal tissues (ovary and adipose, Human 
Body Map) (Table 2). The unusually large number of mass spec- 
trometry experiments, in which matching peptides were identi- 
fied, probably reflects that this chimeric protein is more abundant 
than other such proteins. Interestingly, we were able to verify this 
chimera by RT-PCR using TaqMan probes in different RNA samples 
(Supplemental Material). Notably, this chimera incorporates both 
highly expressed cytosolic proteins (RPL13A, RPKM = 343.5; 
ACTG1, RPKM = 851.6). Particularly, Phyre2 structural prediction 
analysis (Kelley and Sternberg 2009) of this chimera suggests it 
can fold into a 3D structure with 100% confidence and 85% 
identity to the Ribonuclease H-like motif fold (Actin-like ATPase 
domain) (Fig. 4A). Accordingly, we identified a preserved beta 
strand that appears close to the junction site of the chimera, which 
corresponds with high confidence (RMSD < 2.1) to the secondary 
structure of wild-type actin (Fig. 4B). However, an ATP-binding site 
is missing in the chimeric sequence (Fig. 4B), perhaps indicating an 
inability to produce the polymerized form of F-actin. 

For the chimera ESTid = "CN310211.1," we found a unique 
peptide (GRLGQPAMAK, FDR < 1%) (Table 2; Supplemental Ma- 
terial) spanning its gene-gene junction site using the shotgun 
mass spectrometry analysis of the total proteome of the human cell 
lines (see Methods). This chimera incorporates full domains: coiled 
coil domain from COPS7B (RPKM = 3.4) and RAB domain from 
RAB13 protein (RPKM = 35.4). The unique peptide was confirmed 
by the targeted mass spectrometry (SRM) analysis using a synthe- 
sized standard peptide (Supplemental Fig. S3). 



Tissue Specificity of Genes and Chimeras for Different Expression Levels 
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Figure 3. Tissue specificity of all genes versus chimeras. All genes are 
presented in red and chimeras in blue. The expression of chimeras is more 
tissue specific across the different expression levels (ANCOVA, P-value 
< 7.7 x 10~ 13 ). The bins are chosen so as to cover the expression range of 
all chimeras and have an equal number of chimeras per bin. 
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Figure 4. A chimera with confirmed RNA and protein expression. We 
detected two overlapping unique peptides that matched the junction site 
in 18 mass spectrometry experiments and by the targeted mass spec- 
trometry (SRM) analysis, confirming that this transcript (ESTid = 
"BM838228.1") from ChimerDB (Kim et al. 2010) is expressed at the 
protein level. (A) The 3D structure of the chimeric protein is modeled by 
Phyre2 (Kelley and Sternberg 2009). (Green) The chimeric protein part 
derived from actin, ACTC1, predicted using homology modeling; (red) 
the part of the ribosomal protein, RPL1 3A, predicted using ab initio 
methods. The structure is modeled using the Ribonuclease H-like motif 
fold (actin-like ATPase domain) with 100% confidence and 85% identity. 
(6) The secondary structure modeling by Phyre2 (Kelley and Sternberg 
2009) predicts that a highly preserved beta strand appearing in the wild- 
type actin protein should also feature in the chimera (blue rectangle). The 
motif "CDGV" (red rectangle) is the ATP-binding site, which is missing in 
the chimera sequence. 



We performed a second round of shotgun proteomic anal- 
yses, identifying eight of the 11 chimeras found in the first 
round: BE837 730.1, BM827779.1, BM842093.1, BY796539.2, 
CN310211.1, CN430188.1, DB154094.1, and DW419036.1 (see 
Supplemental Material). Moreover, we verified the unique peptide 
(VISSIEQKTMAAPSVK) of the ESTid = "BF96991 1.1" chimera using 
targeted proteomics and fractionation analysis (Fig. 5). Based on 
these proteomics analyses, we surmise that >70% of the peptides 
representing chimeric junctions can be verified in multiple rounds 
of proteomic analysis. In summary, we provide the first unbiased 
genome-wide evidence that chimeras are indeed expressed at both 
the transcriptional and protein levels in humans. These chimeric 
proteins are less abundant than regular proteins and they seem to 
be highly tissue specific. 

Chimeras may alter cellular localizations of proteins 
Chimeras incorporate signal peptides 

The 1999 Nobel Prize in Physiology or Medicine was awarded to 
Gunter Blobel for the discovery that proteins have intrinsic signals 



that govern their transport and localization within the cell 
(Emanuelsson et al. 2007; Clerico et al. 2008). Indeed, these signal 
sequences were shown to serve as zipcodes, specifying the eventual 
destination of the proteins. Signal peptides targeting proteins 
to the endoplasmic reticulum (ER) membrane in eukaryotes are 
15-30 amino acids long, self-contained and removed after target- 
ing (Emanuelsson et al. 2007; Clerico et al. 2008). In eukaryotes, 
proteins translocated across the ER membrane are by default 
transported through the Golgi apparatus and then exported by 
secretory vesicles (Emanuelsson et al. 2007; Clerico et al. 2008). 
Some chimeras incorporate signal peptides that could direct pro- 
teins to the ER and Golgi apparatus. To be functional, these signal 
peptides must be present in the gene that forms the 5 ' end of the 
transcript, and thereby transport also the product of the gene at 
the 3' end of the chimeric transcript. 

For example, we found a human chimeric transcript (ESTid = 
" AJ420584.1") (Supplemental Material) in the ChimerDB data set 
(Kim et al. 2010) that comprises Acetyl-CoA:lyso-PAF acetyl- 
transferase (LPCAT2) and the thioredoxin domain containing 
protein 5 (TXNDCS) and which is translated starting from the 
signal peptide of TXNDC5 (localized in the ER). Moreover, this 
chimera incorporates two transmembrane domains of LPCAT2 
(Fig. 6A). The signal peptide is predicted to redistribute the chimeric 
protein from the plasma membrane to the ER lumen (Fig. 6B). 

Notably, of the 7224 chimeric transcripts from ChimerDB 
(Kim et al. 2010), 32% incorporate signal peptides, and of the 175 
chimeras confirmed by more than two RNA-seq reads, 29% have 
these signal peptides (Table 3; Supplemental Material). Addition- 
ally, for the data set of Li et al. (2009c), we observed 34% of the 
chimeras incorporate signal peptides (Table 3). Given the expected 
percentage for all human genes (22%) (Table3), we concluded that 
signal peptides are significantly incorporated in chimeras (Table 3, 
P-value < 0.001). 

Chimeras are enriched in transmembrane domains 

Transmembrane proteins carry out many key functions in cell 
signaling and transport (Deutsch et al. 2008; Deutsch 2010). Like 
signal peptides, transmembrane (TM) segments determine the lo- 
calization of the proteins in cell membranes. Thus, we anticipated 
that these segments in chimeric proteins lead to the membrane 
association of cytosolic proteins, thereby altering their molecular 
interactions and cellular functions. A chimeric protein has been 
identified that encompasses parts of the matrilin (MATN) and ly- 
sosomal-associated protein transmembrane (LAPTM) genes (Maeda 
et al. 2005). In accord with our hypothesis, the expression and 
subcellular localization of the MATN-LAPTM chimera differ from 
those of the parental wild-type genes participating in the chimera 
(Maeda et al. 2005). Similarly, the TWE-PRIL chimeric protein that 
comprises two tumor necrosis factors, TWEAK and APRIL, contains 
the TWEAK cytoplasmic and TM domains combined with the 
APRIL C-terminal domain (Pradet-Balade et al. 2002). Accordingly, 
TWE-PRIL was shown to be a membrane protein, positioning the 
APRIL receptor-binding domain at the cell surface (Pradet-Balade 
et al. 2002). 

For ChimerDB (Kim et al. 2010), we found that 5 1% (3701/7224) 
of chimeric transcripts have predicted TM domains (Table 3). 
Likewise, 50% (88/175) of chimeras confirmed by more than two 
RNA-seq reads were found to incorporate at least one TM domain 
(Table 3). In addition, for the data set of Li et al. (2009c), 55% (1 10/200) 
of chimeras integrate TM domains (Table 3). To assess the signifi- 
cance of these proportions we used the GENCODE data set of 
22,304 human protein coding sequences (Table 3; Harrow et al. 
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Figure 5. Selective reaction monitoring (SRM) mass spectrometry analysis. The peptide VISSIEQKTMAAPSVK at the junction site of the BF96991 1 .1 
chimera was confirmed by SRM analysis using a stable isotope labeled standard. Briefly, a peptide of the same amino acid sequence was synthesized with 
a heavy lysine residue, which was then spiked into the digested human prostate cancer lysate. The mixture was fractionated by high pH reversed phase 
liquid chromatography and the fractions analyzed by SRM mass spectrometry. On the basis of the concentration of the labeled standard, the chimera was 
estimated to be present at a concentration of —30 fmol/mL. A signal-to-noise ratio was calculated as root-mean-square (RMS). 



2006). We looked for predicted TM domains and found 23% of the 
human proteins in this data set contain at least one TM domain 
(Table 3). We used this finding to calculate the expected number of 
chimeras containing at least one TM domain, taking into account 
the fact that chimeras are generated from two genes but assuming 
an upper boundary for the appearance of TM helices as chimeras 
are rarely generated from two whole proteins. Given these as- 
sumptions, the expected percentage of chimeras incorporating 
one or more TM domains is 40.2%, and that TM domains are sig- 
nificantly enriched in putative chimeras (Table 3, P-value < 0.001). 

Taken together our observations indicate that chimeric tran- 
scripts could at least partially explain the origins of proteins with 
unexpected cellular localizations. Such proteins are frequently 
evidenced in high throughput protein studies, for example, in the 
Dynamic Proteomics study, which aim to monitor the position 
and amount of endogenous proteins in individual human cells 
(Cohen et al. 2008; Frenkel-Morgenstern et al. 2010). 

Discussion 

A number of molecular processes, including fra«s-splicing, trans- 
locations or other chromosomal rearrangements, as well as various 
aberrations of standard co-linear transcription, can produce chi- 
meric transcripts incorporating information from distinct geno- 
mic regions. Here, we describe a new comprehensive approach to 
validating expression of chimeras at the protein level that involves 
identification of peptides spanning the junction sites of chimeras. 
Taking this approach along with shotgun proteomics and targeted 
mass spectrometry analysis, we establish that some chimeras are 
indeed translated and detectable. It should be noted that, in typical 
shotgun proteomic experiments, the standard protein sequence 



databases, such as the UniProtKB (Magrane and UniProt Consortium 
2011), do not contain chimeric proteins. Thus chimeric proteins 
are not taken into account in most proteomic studies. Given the 
rapid advancement of mass spectrometry instruments and ever 
increasing sensitivity it seems likely that more and more chimeric 
protein will be discovered. 



A 

TXNDC5(part) with Transmembrane protein 
signal peptide LPCAT2 (part) 
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Figure 6. Putative chimeric proteins often contain the signal peptides 
or TM domains of the parental proteins. (A) Schematic view of the two 
proteins participating in the human chimera: thioredoxin domain con- 
taining protein 5 (TXNDC5) and lysophosphatidylcholine acyltransferase 
2 (LPCAT2). (6) Schematic view of the hypothetical chimera comprising 
the signal peptide of TXNDC5 and two TM domains of LPCAT2. We 
predict that this chimera is localized in the ER lumen. 
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Table 3. Signal peptides and TM domain frequencies in chimeras 
and all human genes 



Chimeric data set 


Total 
genes 


% Signal 
peptides (N) 


% TM 
domain(s) (N) 


All ESTs a 


7224 


32 (2339) 


51 (3701) 


200 chimeric ESTs b 


200 


34 (68) 


55 (1 1 0) 


All chimeras confirmed 


175 


29 (51) 


50 (88) 


by RNA-seq reads c 








All human genes d 


22,304 


22 (4838) 


23 (5079) 


P-value e 




<0.001 


<0.001 



a AII ESTs and mRNAs from the ChimerDB collection (Kim et al. 2010). 
b Two-hundred transcripts of the human data set (Li et al. 2009c). 
C AII chimeric transcripts confirmed by RNA-seq from all three aforemen- 
tioned data sets. 

d AII human genes from CENCODE (Harrow et al. 2006). 
e P-values were computed by x 2 goodness-of-fit test, comparing the ob- 
served percentage of chimeric proteins with TM domains or signal pep- 
tides for each data set to the expected percentage of TM domains or signal 
peptides for all human genes (the expected percentage of chimeras with 
TM domains is 40.2%). 

Before our study, various chimeric transcripts had been 
detected in diverse species by RNA sequencing and verified ex- 
perimentally, such as the 12 gene fusions in humans (Maher et al. 
2009a,b) and the multiple chimeric transcripts identified by the 
ENCODE pilot project in assorted cell types (Denoeud et al. 2007; 
The ENCODE Project Consortium 2007; Tress et al. 2007; Djebali 
et al. 2008). Furthermore, early EST assembly experiments sug- 
gested the presence of chimeric transcripts in budding yeast, fruit 
flies, mice, and humans and estimated that up to 25%-49% of all 
genes could participate in the formation of chimeric transcripts 
(Akiva et al. 2006; Parra et al. 2006; Li et al. 2009c). This not- 
withstanding, the present study is the first to systematically survey 
public databases for chimeras and validate expression at both the 
transcriptional and protein levels using an unbiased genome-wide 
approach. 

We suspect that the generation of chimeric transcripts and 
subsequent translation into chimeric proteins serve to create novel 
proteins with substantially altered functions compared with the 
constitutive and alternative isoforms. The altered functions in- 
clude modified localization (Thomson et al. 2000; Pradet-Balade 
et al. 2002) and tissue specificity (Akiva et al. 2006; Parra et al. 
2006), and could be linked to specific conditions or diseases, such 
as cancer (Edgren et al. 201 1; Kannan et al. 201 1). We also provide 
support for our premise that chimeras are enriched in signal pep- 
tides and trarcs-membrane domains, which alter the cellular lo- 
calization of proteins participating in the chimeras. Notably, this 
hypothesis accords with the tissue specificity of chimeras, as Schug 
et al. (2005) have proposed that most tissue-specific proteins are 
extracellular and mid-tissue specific proteins are membrane pro- 
teins. In addition, we evidence that chimeras connecting more 
distal genes than neighboring genes tend to incorporate highly 
expressed genes. This latter observation accords with the "RNA 
polymerase-induced" mechanism for the chimera production 
elaborated in Gingeras (2009). Notably, the trend is even more 
stringent for the translated chimeras, because the genes involved 
in translated chimeras are even more highly expressed than genes 
involved in chimeras that are not translated. In light of our data, 
we hypothesize that the protein products of frans-splicing or ge- 
nomic alterations generated during evolution serve to control the 
activity of parental proteins during certain cellular processes or in 
response to stress. Understanding the principles of functional 



chimera design and production is an urgent goal in modern Pro- 
teomics and Genomics. 

To conclude, we establish here that most chimeric RNAs are 
tissue specific and weakly expressed but can be detected by RNA 
sequencing techniques. Since the chimeric transcripts analyzed 
were primarily derived from cancer cells, it is intriguing that 
matching chimeric reads were found in the Human Body Map data 
sets for tissues of healthy individuals. Our observations coincide 
with other recent studies on chimeric transcripts detected in can- 
cers as well as in normal cells (Akiva et al. 2006; Parra et al. 2006; Li 
et al. 2009b). Having validated the existence of chimeric proteins 
in eukaryotes, we caution that chimeric proteins should be con- 
sidered when designing future experimental studies of protein 
localization in both normal and cancer cells. Finally, we suggest 
that chimeras should be taken into account when protein-protein 
interactions are studied, and especially when developing therapeutics. 

Methods 

Data sets and annotation 

To investigate the potential functional role of chimeras, all pub- 
licly reported 7424 human chimeric RNA transcripts were ana- 
lyzed. Specifically, we screened the chimeric ESTs found in human 
cells by Li et al. (2009c) (200 transcripts), together with all the 
chimeric ESTs and mRNAs (7224 transcripts) in ChimerDB (Kim 
et al. 2010). All these chimeric RNAs have well-defined junction 
sites (at least three nucleotides on either side of the junction). 
However, only a few of the chimeric sequences exhibit canonical 
splice-junction sites (Li et al. 2009c). 

Initially, sequence similarities between the chimeric RNA 
transcripts of Li et al. (2009c) and human genomic regions were 
identified using in-house software and the UCSC BLAT search 
(Kent 2002; Rosenbloom et al. 2012) to annotate the genes par- 
ticipating in each chimera. NCBI BLAST (Altschul et al. 1997) was 
applied to delineate the wild-type protein domains corresponding 
to the genomic regions contained within each chimeric mRNA. All 
the domain annotation results were manually inspected. WU 
BLAST (Lopez et al. 2003) was employed to define more precisely 
short or "strange" genomic regions, as WU BLAST has proven most 
efficient when transcript composition is unknown (Elizabeth Cha 
and Rouchka 2005). Finally, FASTA (Pearson and Lipman 1988) 
was used to find the 100% identical sequence matches for peptides 
identified by the experimental mass spectrometry analyses for the 
gene-gene junction of chimeras. 

Confirmation of chimeric transcripts by RNA-seq 

To assess if a chimeric transcript is present in some RNA sample, we 
aligned (mapped) the RNA-seq reads to the sequence of each chi- 
mera and its junction sites. To ensure that the read could be un- 
ambiguously assigned to the chimera, and not to another location 
in the genome, we performed the following mapping protocol. 
First, we mapped the RNA-seq reads to the reference genome, in 
order to identify which reads can be assigned to exons, i.e., exonic 
reads. In order to identify junction reads, we constructed, for each 
gene, the combination of all pairs of exons, yielding a collection of 
all possible intra-gene exon junctions and mapped the RNA-seq 
reads to these junctions. We then selected the reads, which are not 
mapped in the previous stages (i.e., reads not mapping to anno- 
tated transcripts or novel exons) and mapped them to the chimeric 
transcripts. Finally, we selected only the reads that mapped pre- 
cisely to the junction of the chimera, with a minimum of six nu- 
cleotides, two codons, mapping on each side of the junction. The 
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number of reads mapping to the junction was taken as an in- 
dication of the abundance of the chimera in the RNA sample. Our 
mapping protocol can be considered stringent as it ensures that, if 
a read maps both to a known transcript and to a chimeric tran- 
script, it will be assigned to the known transcript. This procedure 
naturally excludes reads mapping to multiple locations on the 
genome (repetitive regions), since chimeric reads correspond to 
reads not mapping to any location on the genome, or the anno- 
tated transcriptome. All the mappings were performed using GEM 
(http://sourceforge.net/apps/mediawiki/gemlibrary) allowing for 
a maximum of three mismatches (Djebali et al. 2008). 

The RNA-seq data sets used for the mapping protocol were the 
Human Body Map 2.0 data generated on the HiSeq 2000 by Illu- 
mina in 2010. The data set comprises 1097 million (M) paired-end 
reads of 75 nt, resulting from sequencing RNA taken from 16 dif- 
ferent tissues. 

Quantifying chimeric transcripts with RNA-seq data 

Since chimeric transcripts are combinations of annotated tran- 
scripts, their identification and quantification is challenging. To 
quantify a chimeric transcript, we only considered reads unam- 
biguously mapping to its junction. However, this number neces- 
sarily depended on the depth of sequencing and on the length of 
the considered region (in this case the junction). Therefore, we 
adopted the measure introduced by Mortazavi et al. (2008), namely 
RPKM. RPKM is defined by the formula (Mortazavi et al. 2008): 

, total jeads .identified junction 

RPKM= ; — — ; — — — ■ ; r 1 

mapped seads(millions) ■ reads length junction(KB) 

where total _reads_identified junction is the number of reads that have 
been mapped to a chimeric junction, reads _lengthjunction(KB) is 
the size of the region considered to cover the junction, mapped_ 
readsfmillions) is the overall number of mapped reads in millions of 
reads. 

In our case, the size of the considered region of reads_ 
length Junction, J, is not the sum of exon length (usually used), but 
simply the size of the junction calculated as follows: 

J=2*(L — M) [2] 

where L is the read size and M the minimum number of nucleotides 
required on each side of the junction to assign the read to the 
junction. 

Thus, for the human RNA-seq data set, L = 75 and M = 6, 
therefore / = 138 nt, or 0.138 kb. The total number of reads se- 
quenced is 1097 M. Hence, for example, chimera BP305895.1, 
which has 291 reads mapping to its junction, has an expression 
level of 291/1097/0.138 = 1.92 RPKM. 

The correspondence between RPKM and the number of 
transcripts per cell is still not clearly established. Mortazavi et al. 
(2008) consider that 3 RPKM corresponds to approximately one 
transcript per cell in mouse liver, whereas Klisch et al. (2011) 
suggest that 1 RPKM corresponds to between 0.3 and 1 transcripts 
per cell. 

Tissue specificity of chimeras 

The tissue specificity of any transcript was measured using Shannon 
entropy (Schug et al. 2005). The expression level of a gene in some 
tissue and the entropy were calculated (Schug et al. 2005). The 
entropy has units of bits and ranges from zero for genes ex- 
pressed in a single tissue to log 2 (N) for genes expressed uniformly 
in all N = 16 tissues considered. 



Identification of chimeric proteins by evidences in PeptideAtlas 
and GPM 

To assess which chimeric transcripts are translated into chimeric 
proteins, we sought to identify unique mass spec peptides that 
match the gene-gene junctions of the chimeras. To this end, we 
searched the mass spectra of two publicly available proteomic da- 
tabases, the GPM (Craig et al. 2004) and PeptideAtlas (Deutsch 
et al. 2008; Deutsch 2010), for evidence of such peptides using the 
default options of the XlTandem search engine (Muth et al. 2010). 
We used the GENCODE annotation (version 3C) of the human 
genome (Harrow et al. 2006, The ENCODE Project Consortium 
2007) as the set of known protein sequences and generated ran- 
domized decoy sets of the same size and composition as the 
GENCODE 3C and chimera search sets. We combined the experi- 
mental peptide-spectrum matches (PSM) found in each individual 
experiment in PeptideAtlas and GPM using the P- values generated 
by XlTandem. 

The combined P-values were used to rank the PSMs, and the 
simple FDR (number of decoys divided by number of correct 
matches) that could be estimated from the peptides in the decoy 
and GENCODE sets was corrected using a multiple testing method. 
We assigned q- values (the minimal FDR threshold at which a given 
peptide is accepted; Kail et al. 2009) to the PSM. Chimeric tran- 
scripts with expressed peptides corresponding to their gene-gene 
junction site confirmed by a PSM below the corrected FDR 
threshold (1% for all the cases confirmed by RNA-seq, or not) were 
considered potential true positives. 

Shotgun proteomics experiments 

To witness chimeric proteins we employed "bottom-up" shotgun 
proteomics using two-dimensional liquid chromatography cou- 
pled with high-resolution tandem mass spectrometry. The plat- 
form was operated in data independent mode as described in Levin 
etal. (2011). 

Human cell lines 

Three human cancer cell lines were subjected to proteomic anal- 
ysis: the MCF7 human breast epithelial cell derived from mam- 
mary gland adenocarcinoma (HTB-22), the OVCAR-3 human epi- 
thelial cell line derived from ovary (HTB-161), and the DU-145 
human epithelial carcinoma derived from prostate (HTB-81). The 
cells were prepared as explained in the Supplemental Methods. 

Proteome sample preparation 

Proteins in the cell lysates were reduced by addition of dithiol- 
threitol (Sigma; 5 mM) and incubation for 30 min at 60°C and then 
alkylated by addition of iodoacetemide (Sigma; 10 mM) and in- 
cubation in the dark for 30 min at 21°C. The proteins were then 
digested by incubation with trypsin (Promega) for 16 h at 37°C, 
added at a ratio of 1:50 (w/w trypsin/protein). Digestions were 
stopped by the addition of 1% trifluroacetic acid (TFA). Following 
digestion, detergents were removed using the Pierce Detergent 
Removal spin columns according to the manufacturer's procedure. 
The samples were stored at -80°C in aliquots. 

Liquid chromatography-mass spectrometry 

Digested protein (15 |j,g) from each sample was analyzed by nano- 
Ultra Performance Liquid Chromatography (10 kpsi nanoAcquity; 
Waters) in high-pH/low-pH reversed phase (RP) 2 dimensional 
liquid chromatography mode, coupled to high resolution, high 
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mass accuracy mass spectrometry (Synapt G2 HDMS, Waters). The 
quadrupole ion mobility time-of-flight mass spectrometer was 
tuned to 20,000 mass resolution (full width at half height). Data 
were acquired in HDMS E positive ion mode in data independent 
acquisition (for further details see Supplemental Methods). 

Bioinformatics procedure 

Raw data processing and database searching was performed us- 
ing Proteinlynx Global Server (IdentityE) version 2.5. Database 
searching was carried out using the Ion Accounting algorithm 
described by Li et al. (2009a). Briefly, the algorithm detects the 250 
most abundant peptides and performs an initial pass through the 
database in order to identify these peptides (with mass tolerance of 
7 ppm for precursor ions and 15 ppm for fragment ions). These 250 
peptides were used to calibrate 14 predetermined models of spe- 
cific, physicochemical attributes (such as retention time and 
fragmentation prediction, fragment to precursor ratios, etc.). These 
peptides are depleted from the database before the remaining 
peptides are sought in the database. The cycle continues to the 
next most abundant peptides, which are identified and then de- 
pleted from the database. The tentative peptide identifications are 
ranked and scored based on how well they conform to the 14 
physiochemical models and reported in a final list. Trypsin was set 
as the protease, two missed cleavages were allowed, and fixed 
modification was set to carbamidomethylation of cysteines. Vari- 
able modification included oxidation of methionine. 

The data set of combined human Swiss-Prot and all ESTs 
(translated in six frames) from ChimerDB (Kim et al. 2010) was 
employed. The criteria for protein identification were set to min- 
imum of three fragments per peptide, five fragments per protein, 
and FDR < 1% (Keller et al. 2002; Nesvizhskii et al. 2003). The 
peptide score of 6.7 was estimated as a threshold for FDR of 1% 
using all sequences from our data set ("target set") versus all re- 
versed sequences ("decoy set") (Supplemental Fig. SI). Finally, 
search results were imported into Scaffold v3.2 for manual in- 
spection and reporting. 

Targeted analysis in selective reaction monitoring mode [SRM) 

The liquid chromatography mass spectrometry in SRM mode 
technique is widely used in proteomics for targeted analysis 
(Addona et al. 2009; Stergachis et al. 2011). The peptides were 
synthesized (JPT Peptide Technologies, http://www.jpt.com/) with 
heavy isotopic labels: C terminus R (15N6, 13C4) or C terminus K 
(15N6, 13C2) and added to the cell lysates prior to the analysis. 

Sample preparation for SRM 

An aliquot was taken from the digested samples outlined in the 
previous section. Samples were diluted to 0.5 ^g/jj-L in 97:3% 
H 2 O:ACN+0.1%TFA. 

Liquid chromatography 

ULC/MS grade solvents were used for all chromatographic steps. 
Each sample was loaded using split-less nano-Ultra Performance 
Liquid Chromatography (10 kpsi nanoAcquity; Waters). The mo- 
bile phase was: (A) H 2 0 + 0.1% formic acid and (B) ACN + 0.1% 
formic acid. Desalting of samples was performed online using 
a reverse-phase C18 trapping column (180 (j.m i.d., 20 mm length, 
5 jjim particle size; Waters). The peptides in samples were separated 
using a C18 T3 HSS nano-column (75 (j.m i.d., 150 mm length, 1.8 
(j.m particle size; Waters) run at 0.4 |j.L/min. Peptides were eluted 
from the column and into the mass spectrometer using the 



following gradient: 3%-30%B over 40 min, 30%-95%B over 5 min, 
maintained at 95% for 7 min and then back to initial conditions. 

Mass spectrometry analysis 

The nanoLC was coupled online through a nanoESI emitter (7 cm 
length, 10 mm tip; New Objective) to a tandem quadrupole mass 
spectrometer (Xevo TQ-S, Waters). Data were acquired in SRM us- 
ing Masslynx 4.1. Data were then imported into Skyline (Maclean 
et al. 2010a, MacLean et al. 2010b) for final processing and eval- 
uation. Signal-to-noise ratio was calculated by root-mean-square in 
Masslynx software (Waters) with no extra processing. Minimum 
criteria were 5:1 signal to noise. 

Signal peptides and transmembrane domain analysis 

All chimeras translated in six frames were submitted to the SignalP 
3.0 and TMHMM 2.0 servers (Emanuelsson et al. 2007). A chimera 
was considered as transmembrane (TM) if the TMHMM short re- 
port predicted one (or more) TM domain in at least one translated 
frame, and the chimera did not have a predicted signal peptide at 
the overlapping region of the TM domain. The summary outputs 
of SignalP 3.0 and TMHMM 2.0 are available in the Supplemental 
Material (Supplemental Data and http://chimera.bioinfo.cnio.es/). 

Data access 

The raw spectra identified for the three human cell lines and the 
Skyline projects that contain the SRM traces have been uploaded to 
Tranche (www.proteomecommons.org, Access password: chimera), 
using the hashes listed in the Supplemental Methods; for the cell 
lines: prostate cancer (HTB-81), breast cancer (HTB-22), ovary can- 
cer (HTB-161). All the sequences of the chimeric RNAs presented in 
this study are ESTs or mRNAs from GenBank (http://www.ncbi.nlm. 
nih.gov/genbank) and can be found using ESTid listed for all the 
chimeras in the manuscript. 
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